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Abstract. We perform a smoothed analysis of the condition number of rect- 
angular matrices. We prove that, asymptotically, the expected value of this 
condition number depends only of the elongation of the matrix, and not on the 
center and variance of the underlying probability distribution. 

1 Introduction 

The most widely used extension to rectangular matrices of the notion of inverse 
of square matrices is the so called Moore-Penrose inverse. For a full rank matrix 
A e R mxn this is defined as A* := (A T A)~ 1 A T if m > n, and as At := A T (AA T )~ 1 , 
otherwise. Immediate applications of A^ include the solution of least square prob- 
lems 

minPx-6|| 2 , (1) 

with b G M m and m > n, or of smallest solutions of underdetermined systems 

min ||x|| 2 (2) 
x\Ax=b 

when n > m. In both cases, the solution is given by x = A^b. Well known re- 
sults in error analysis show that the accuracy in the computation of A\ or in the 
computation of the solution x for the problems above, crucially depends on the 
condition number k(A) := \\A\\ \\A'\\ of A, where \\A\\ denotes the spectral norm 
(see [14, Ch. 19]). Accuracy analysis is not the only source of interest in k(A). Al- 
gorithms such as the conjugate gradient method produce approximate solutions of 
linear systems Px = c — here P G ]j mxm j s a positive definite matrix and c G M m — 
with a number of iterations proportional to W n(P) and, in many cases, the matrix 
P has been obtained as P = AA T for some matrix A G M. mxn . In those cases, 
■sj k(P) = k(A) and one is again interested in the latter, this time by complexity 
considerations. 
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The condition number k(A) is not directly readable from A, and its computation 
seems to require that of A^ . This is a common situation in numerical analysis. A 
way out of it, proposed as early as 1951 by von Neumann and Goldstine [17] and 
more recently pioneered by Demmel [6] and Smale [20], consists of randomizing 
the matrix A — say, by endowing M. mxn with a multivariate standard Gaussian 
distribution N(0, 1) — and considering its condition number as a derived random 
variable. 

In Chen and Dongarra [4] the following tail estimates on k(A) were shown for 
A G W nxn with n> m: for x > n — m + 1 we have 

1 / 1 \ n—m+l ( x 1 1 / 7 \n-m+l 

V2vr V5x/ a~jv(o,i) I 1 — A J y/2n\xJ w 

Moreover, the expectation E(«(-A)) can be bounded as a function of the elongation 
only, independently of n. (We remark that this is not true for Demmel's scaled 
condition number ||^||, compare [9].) More precisely, for a sequence (m n ) of 

integers such that lim n ^oo m n /n = A € (0, 1) and a sequence of standard Gaussian 
random matrices A n G W mnXn , we have in almost sure convergence 

This follows from Geman [10] and Silverstein [19] (see Edelman [8] for more precise 
results). 

The above results provide theoretical reasons of why least squares problems 
such as (1) or underdetermined systems such as (2) are solved to great accuracy 
or why the conjugate gradient method is so efficient in practice. In fact, it follows 
from (4) that the expected number of iterations of the conjugate gradient method 
on the random input P = AA T remains bounded in terms of the elongation m/n as 
n — > co and A G M. mxn is standard Gaussian. Our main result stated below implies 
that this phenomenon is still true for any matrix that is only slightly perturbed. 

The choice of N(0, 1) as underlying data distribution is pervasive in the average- 
case analysis of condition numbers (and other quantities occurring in numerical 
analysis) . It has the virtue of simplicity as a first approach to understanding which 
condition numbers one may expect. But it has been criticized due to the loose 
relationship of the Gaussian iV(0,I) to the measures that may be governing data 
drawing in practice. In particular, it has been observed that the use of Gaussians 
may be 'optimistic' in the sense that they may put more probability mass on the 
instances where the values of the function ip under consideration are small. Such an 
optimism would produce yield an expectation E(Y>) smaller than the true one. 

An alternate, more conservative, form of analysis has been proposed by Spielman 
and Teng under the name of smoothed analysis. It replaces the Gaussian measure 
N(0, 1) by the measures N(A, a 2 Y) where A is arbitrary. The idea is then to re- 
place the unlikely 'average data' by a (usually small) perturbation of any possible 
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occurring data. The rationale for this form of analysis is offered in a number of 
papers [21, 18, 22, 23] and we won't repeat it here in full. We note, nonetheless, 
that the local nature of randomization in smoothed analysis, coupled with its worst- 
case dependence on the input data, removes from smoothed analysis the possible 
optimism we mentioned above for average-case analysis. In recent years, different 
aspects of algorithm behavior for a variety of problems have been analyzed this 
way. These include condition numbers of square matrices with real [27] or {—1, 1} 
coefficients [24], complexity of interior-point methods [7], and machine learning [1]. 
The typical satisfying result is polynomial smoothed complexity (see [23, Def. 2]), 
consisting of a bound of the form 

sup E il>{A) < co-- kl s\ze(A) k2 (5) 

A A~N(A,a 2 i) 

where ip is the function whose behavior we are analyzing and c,k±,k2 are positive 
constants. 

In this paper we provide a smoothed analysis for Moore-Penrose inversion, ex- 
tending (3) from the average-case analysis to smoothed analysis. To state the results 
we need to introduce some notations. We assume 1 < m < n throughout the paper. 
For a standard Gaussian X £ R mxn we put 

Q( m ,„):=-L E (||X||). (6) 
v ^ 

(Lemma 2.4 shows that Q(m,n) < 6 .) We define for A G (0, 1) the quantity 



Note that c(A) is monotonically increasing, lim^o C (A) = -j= and lim^i c(A) = oo. 

Further, for 1 < m < n and < a < 1, we define the elongation A := m= ^- and 
introduce the quantity 

( a (m,n) := (Q(m,n) H -=) c(A)»-™+i. (8) 

Our main result is the following tail bound on the condition number of rectan- 
gular matrices under local Gaussian perturbations. 

Theorem 1.1 Suppose that A G M mxn satisfies \\A\\ < 1 and let < a < 1. Put 
A := Then, for z > ( a (m,n), we have 

{r -i -1 -| n—m+1 
k(A)>-^-\ < 2c(A) (Q(m,n) + V21n(2z) + ^) 
., 1 - A J V oVn/ 
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Remark 1.2 1. The decay in z in this tail bound is the same as in (3) up to the 
logarithmic factor \/ln z. We believe that the latter is an artefact of our proof that 
could be omitted. In fact, the exponent n — m + 1 is just the codimension of the 
set E := {A £ flj mx « | ^ m j Q £ ran k deficient matrices, cf. [12]. Moreover, 
it is known [14] that ||Af|| = l/dist(A, S) where the distance is measured in the 
Euclidean norm. From the interpretation of Prob{K(A) > t} as the volume of a 
tube around S, as discussed in [2], one would therefore expect a decay of order 
1 J z n ~ m +^ 

2. When a = 1 and A = 0, Theorem 1.1 yields tail bounds for the usual average 
case. One may therefore compare these bounds with (3). In doing so, we see that 
the bound in Theorem 1.1 has the additional factor c(A) (going to oo for A — > 1). 
However, we note that the bound (3) holds only for x = ez > n — m + 1, while our 
bound holds for any z > (aim, n). Furthermore, if we fix A € (0, 1) and let (m n ) be 
a sequence of positive integers such that lim m n /n = A, it follows from [10] that 

lim Q(m n , n) = 1 + VX. 

n— >oo 

This implies that hin^^oo (, a (yn n , n) = 1 + y/X for fixed a € (0, 1] and, in particular, 
that C(j(m n ,n) < 2 for sufficiently large n . That is, for large n, the tail bound in 
Theorem 1.1 is valid for any z > 2. 

Theorem 1.1 easily implies the following bound on expectations. 

Corollary 1.3 For all Xq G (0, 1) there exists no such that for all 1 < m < n such 
that X = ^—1 < An and n > no we have for all a with — 5= < a < 1, and an 1 
A G R^x™ || A|| < 1 ; that 

, / aw ^ 20.1 
E (k(A)) < -. 

As for the average-case analysis, this bound is independent of n and depends only 
on the bound An on the elongation. Thus we have a bound of type (5) with ki = 0. 
Surprisingly, the smoothed complexity bound in Corollary 1.3 is also independent 
of a. We thus add reasons — and we will become more specific in Section 4 — to the 
current understanding of the accuracy in least squares or underdetermined system 
solving or the complexity of the conjugate gradient method. 

A first approach to the smoothed analysis of Moore-Penrose inversion appears 
in [5] . The bounds obtained in that paper are worse by an order of magnitude than 
those we obtain here. In Section 5 we compare these bounds with ours as well as with 
actual averages obtained, for specific values of n, m and a, in numerical simulations. 

Our proof techniques are an extension of methods employed by Sankar et al. [18]. 

Acknowledgements. This work was carried out during the special semester on 
Foundations of Computational Mathematics in the fall of 2009. We thank the Fields 
Institute in Toronto for hospitality and financial support. 
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2 Preliminaries 



2.1 Some definitions and notation 

The spectral norm of a matrix A G W nxn is defined as ||A|| := sup|j :r || =1 ||^4x||, 
where ||x|| denotes the Euclidean norm. The Frobenius norm of A is defined as the 
Euclidean norm of A when interpreted as a vector. 

Suppose that A G W nxn is of maximal rank and m < n. The Moore-Penrose 
inverse of ^4 is defined as A^ := A T (AA T )~ 1 G M nxm . It can also be characterized 
as follows. For any v G M m the vector w = A'v is orthogonal to the kernel of A and 
satisfies Aw = v. The condition number k(A) is defined as k(A) := \\A\\ ■ \\A^\\. 

Let A G ]R mxn and o" > 0. The isotropic normal distribution N(A,aI) with 
center A and covariance matrix cr 2 I is the probability distribution on ]R mxn with 
the density 

I _||A-A||J, 

PA,a( A ) := f0 ,g e ^ • 
(27TJ 2 

A 

Lemma 2.1 For A G (0, 1) we have A X - A < e. 

Proof. Writing u = 1/A the assertion is equivalent to u u - 1 < e or u < e u ~ , 
which is certainly true for u > 1. □ 



2.2 Concentration on spheres 

Let S m_1 := {x G M m | ||x|| = 1} denote the unit sphere in W 71 . We denote by m _i 
its volume, which is given by O m _i = 27r m / 2 /T(J%). 

The following estimate tells us how likely a random point on S m_1 will lie in a 
fixed spherical cap. 

Lemma 2.2 Let it G S™ -1 be fixed, m>2. Then, for all £ G [0, 1], 



Prob \\u T v\ > > J— (l-£ 2 )^. 

^[/(S™- 1 ) V vrm 

Proof. We put = arccos£ and let cap(u, 0) denote the spherical cap in § m_1 
with center u and angular radius 9. Using the bounds in Lemmas 2.1 and 2.2 of [3] 

we get 



;| T , . 2volcap(n,g) 20 m „ 2 (1 - f 
Prob \ \u v\ > t\ = — = — > 



m— 1 



i,~t/(§™-i) 11 1 J volS™- 1 ~ O m _i (m-1) 

Using the formula for O m -\ and the recursion Y(x + 1) = xY{x) we have 

o m - 2 _ i r(f) i r(*fi) r(f) m -i r(f) 
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The assertion follows now from the estimate 

r(™+!) " Vm- 1 j 

This estimate can be quickly seen as follows. Suppose that Z G R m is standard 
normal distributed. Using polar coordinates and the variable transformation u = 
p 2 /2 we get 

E(l|z|l) - (SytX " e ^r 2 2 i, " 16 * 

= |^2^r(^l±i) _ v^^. (io) 

(2vr) 2 2 T( T j 

where we used the definition of the Gamma function for the second last equality. 
To complete the proof of (9) we note that E(\\Z\\) < y / E(PF) = \fm. □ 

For later use we note that (10) implies 

r(n^i) r(^)r(^) m r(^) 



> 



m 



r (f) r (f) r ( m j 1 ) 2 r (^) ~ 2 Vm + 1' 
using (9) for the right-hand inequality. Therefore 



m 



E(||Z||) > -_=. (11) 
V^t + 1 

2.3 Large deviations 

We will use a powerful large deviation result. Let F : 1^ — > R be a Lipschitz 
continous function with Lipschitz constant L, so that \F(x) — F(y)\ < L\\x — y\\ for 
all x, y € M. N , where || || denotes the Euclidean norm. Now suppose that x £ R^ is 
a standard Gaussian random vector such that ~E(F(x)) exists. Then it is known [16, 
(1.4)] that for all t > 

t 2 

Prob{F(x) > E(F) + t} < . (12) 

(We note that in [16, (1.4)] this is only stated for the median, but the inequality 
holds as well for the expectation. See also [15].) 

2.4 A bound on the expected spectral norm 

The function W nxn — > R mapping a matrix X to its spectral norm ||X|| is Lipschitz 
continuous with Lipschitz constant 1, as ||X — Y\\ < \\X — Y\\p. The concentration 
bound (12), together with (6), implies that for t > 0, 

Probj ||X|| > Q(m,n)y/n + t} < e~~. (13) 
This tail bound easily implies the following large deviation result. 
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Proposition 2.3 Let A G R mxn withm< n, \\A\\ < 1, and a G (0,1]. If ,4 G M mx ™ 
foliows tiie law N(A, a 2 l), then, for t > 0, 

Prob \\\A\\ > Q(m,n)a^ + t + l\ < e"^. 

PROOF. We note that \A\ > Q(m,n)ay/n + t + 1 implies that ||A - A\\ > 
\\A\\ - \\A\\ > Q(m,n)^/n + t. Moreover, if A G R mxn follows the law N(A,a 2 I), 

then X := is standard Gaussian in M. mxn . The assertion follows from (13). 

□ 

We derive now an upper bound on Q(m,n). Such result should be well-known 
but we could not locate in the literature. 

Lemma 2.4 For n > 1 we have < Q{m, n) < 2^1 + ^ 21n (2m-i) + < g 

The proof relies on the following lemma. 

Lemma 2.5 Let n, . . . , r n be independent random variables with nonnegative val- 
ues such that rf is x 2 -distributed with fi degrees of freedom. Then, 

E ( max r; ) < max \fYi + V2 In n + 1. 

V l<i<n / l<i<n 

Proof. We start by a large deviation estimate for x 2 -distributed random vari- 
ables. Note that — >■ M, x i— >■ ||x||, is Lipschitz continuous with Lipschitz con- 
stant 1. From (12) we know that for standard Gaussian x G W 1 and all t > 0, 

Prob{||a;|| > E(||x||) +t} < . 
Since E(||x||) < yWPF) = v 7 /, this implies for all t > 0, 

Prob{||x|| > 77 + *} < e - ^. (14) 

We suppose now that T\,...,r n are independent random variables with non- 
negative values such that rf is ^-distributed with fi degrees of freedom. Put 
/ := maxj fi. Equation (14) tells us that for all i and all t > 0, 

i— t 2 
Prob{ri > Vf + t} < e~~ 

and hence, by the union bound, 

Prob < max r i > \ ff + t \ < ne~~2~ . 
I l<i<n v J 
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For a fixed parameter b > 1 (to be determined later), this implies 

E( max n) < \f] + b+ I Prob{ max r { > T} dT 

l<i<n ' JyfJ+b l<i<n 

= V r f + b+ Prob{ max n> ^f] + t}dt 

Jb l<i<n 

r- f°° t 2 

< Vf + b + nJ e'^dt. 



Using the well-known estimate 
1 



2vr J b 



e~- dt < 



1 



bV2ir 



e 2 < 



e 2 



we obtain 



b 

E( max rj) < -y// + b + ne~~. 

Ki<n 



Finally, choosing b := a/2 Inn we get 



E( max rj) < a/7 + V^2 Inn + 1, 

KKi! 



as claimed. 



□ 



PROOF of Lemma 2.4. A general matrix X G ]j mxn can be transformed into a 
bidiagonal matrix of the form 



Y := 



Wm-l Vn-1 



••• 



wi v n - m+1 ••• 



with Vi,Wj > by performing Householder transformations from the left and right 
hand side of X, cf. [11, §5.4.3]. In particular, ||X|| = ||y||. An analysis of this 
transformation shows that if we start with a standard Gaussian matrix X, then the 
v n , . . . , v n - m+ i, w m -i, . . . , wi are independent random variables such that vf and 
wf are x 2 -distributed with i degrees of freedom, cf. [19]. 

The spectral norm of Y is bounded by maxj V{ + maxj Wj < 2r, where r denotes 
the maximum of the values Vi and Wj. Lemma 2.5 implies that, for n > 1, 

E(r) < aA + V 2 H 2m - 1) + 1 < 3aA- 

This shows the claimed upper bound on Q(m, n). For the lower bound we note that 
||Y|| > \v n \ which gives E(||Y||) > E(|v n |)- The claimed lower bound now follows 
from (11), which states that E(|u n |) > \J^\- a 
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3 Proof of the main results 



The main work consists of deriving tail bounds on ||-AT||, which is done in the next 
subsection. 

3.1 Tail bounds for pt|| 

Proposition 3.1 Let A G M m ><™, a > 0, and put A := 2*zl. For random A ~ 
N(A, cr 2 I) we have, for any t > 0, 



Prob illA+ll > — ^— ) < c(X)(—^—\ 



(l-A)n 

^P f ll>^!> < c(A)(-^] 

A~N(A,a 2 I) 

We first show the following result. 
Proposition 3.2 For all v G § m ~\ A G M mxn , a > 0, and £ > we have 

Prob {PM|>£} < _J 

Proof. We first claim that, because of unitary invariance, we may assume that 
v = e m := (0, . . . , 0, 1). To see this, take $€ [/ (m) such that w = fe m . Consider the 
isometric map A^ B = &~ 1 A which transforms the density p^ a {A) into a density 
of the same form, namely p^-i-^ a (B). Thus the assertion for e m and random B 
implies the assertion for v and A, noting that A'v = B'e m . This proves the claim. 

We are going to characterize the norm of w := A*e m in a geometric way. Let aj 
denote the ith row of A. Almost surely, the rows cti, . . . , a m are linearly independent; 
hence, we assume so in what follows. Let 

R := span{ai, . . . , a m }, S := span{ai, . . . , a m _i}. 

Let denote the orthogonal complement of S in W 1 . We decompose a m = a^+a^, 
where denotes the orthogonal projection of a m onto S 1 - and G 5. Then 
o^eiJ since both a m and are in i?. It follows that G R PI S^. 

We claim that to 6 fin S'-'- as well. Indeed, note that i? equals the orthogonal 
complement of the kernel of A in K n . Therefore, by definition of the Moore- Penrose 
inverse, w = A*e m lies in R. Moreover, since AA* = I, we have (w, ai) = for 
i = 1, . . . , m — 1 and hence weS 1 as well. 

It is immediate to see that dim R n = 1 . It then follows that R n = Rw = 
Ra],. Since (w,a m ) = 1, we get 1 = (w,a m ) = (w,a^) = \\w\\ \\a^\\ and therefore 

ll^emll = (15) 

ll a rall 

Let A m G 1 ) xra denote the matrix obtained from A by omitting a m . The 
density p-^ a factors as p-^ a (A) = pi(A n )p 2 (a n ) where p\ and p 2 denote the density 
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functions of N(A m ,a 2 l) and N(a m ,a 2 l), respectively (the meaning of A m and a Ti 
being clear). Fubini's Theorem combined with (15) yield, for £ > 0, 



3b {pt em || >£} = / P^^eM (16) 

^ 2 I) ■/||Ate m ||>C 

Pi (-4 m) ' I / P2( a m) da m I dA m . 

\J\\a,±\\<l/£ / 



/ 

J A, 



J 

J\\a 



lArneRi™- 1 )*™ \J\\a^\\<l/^ 

To complete the proof it is sufficient to show the bound 

p 2 (a m ) cfa m < — — — (17) 

a ± ||< i (V2^) n - m+1 n - m + 1 Va£/ 

for fixed, linearly independent ai, . . . , a m _i and £ > 0. 

To show (17) note that ~ N(a^,a 2 l) in S"- 1 ~ l n_m+1 where is the 
orthogonal projection of a m onto S- 1 -. Let B r denote the ball of radius r in W 
centered at the origin. It is easy to see that vol B r = O p -\r v '/p. For any ieK p and 
any a > we have 

2 



1 f iixir 

Prob {llxll < e) < Prob {llxll < e) = ; / e ~z^ T dx 

-JV(x,<r»I) L x~7V(0,a2l) ' (cr V / 2^r) p 7||a;||<e 



a;~iV(S,cr 2 I) 

:r=o-z 



= — / e " 2 cfe 
2tt)p 7|| 2 



< " vol ff^ = , f-V vol^i 



(v 7 ^^""'^" ~ (v 7 ^)* 5 ^ 
— f- 



(V2^)p\ctJ p 

Taking x = a^, e = |, and p = n — m + 1 the claim (17) follows. □ 

Proof of Proposition 3.1. The proof is based on an idea in [18]. For A G 
M mxn there exists u A G § m_1 such that \\A^\\ = H^uaII- Moreover, for almost 
all A, the vector ua is uniquely determined up to sign. Using the singular value 
decomposition it is easy to show that, for all v G S" 1 ^ 1 , 

\\AK\\>\\A^\\.\u T A v\. (18) 

Now take A ~ N(A,a 2 l) and v ~ [/(S" 1-1 ) independently. Then, for any 
s G (0, 1) and t > we have 

Problp^H > ty/l - s 2 } > Prob (pt || >tk \u%v\ > \A - s 2 \ 

A,v J A,v I ' J 

= Prob { || ^ || > t} -Probjl^l > \A - s 2 | pt|| > t} 

>PK>b{Mt||>t}.yX a m-l > 
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the last line by Lemma 2.2 with £ = y/l — s 2 . Now we use Proposition 3.2 with 
£ = ty/l — s 2 to deduce that 



Prob{pt||>t} < ^-^ftobOAM^tv^ 5 } (19) 

< 1 On- m / 1 \n-m+l 



2s m ~ l (^)n-m n - m+ \ \ at ^fl^2 



We next choose s € (0, 1) to minimize the bound above. To do so amounts to 
maximize (1 — x) 2 x 2 where x = s G (0, 1), or yet, to maximize 



/ n — m + 1 m— 1 \ n n — m + 1 m— 1 -i \ \ 

= I (1 — X) 2 X 2 \ = (1 — X) n X n = (1 — X) X . 

We have ^ ln^(x) = | — with the only zero attained at x* = A. 
Replacing s 2 by A in (19) we obtain the bound 

Prob{pT|| >t}< 1 



A L " 11 - J - 2A 4f (v 7 ^)™^™ (1 - A)n VfJtv 7 !^ 
Lemma 2.1 implies 

An / A \ (l-A)n (l-A)n 

A~— = f A aa-A) J < e^2— . 

So we get 

x/A^ + T 1 C»n-m / \ (1_A)n 



A 111 11 - / - 2 (v 7 ^)™"™ (1- A)n Vat\/r^A 
VA^+T / e \ t 1 O ra - m / J_\ (1 " A)n 



1 - Ay (y/2ir) n - m (1 - A)n \at J 

-A)V n^U-V (v^7r) n " m 



VT+T 1 ^ e \ 2 2vr^2— ( ^- 



2(1 -A) V^V 1 -^; r(^f±i)(\/2^)«- m \ot, 

VT+X 1 / e \^ v^F /l\ (1 " A)n 



1-A V^V 1 -^/ T( n( - 1 ~ x) )2 {J ^r 21 ^ at . 

We next estimate r( ^~ 2 ^ n ). To do so, recall Stirling's bound 

V2irx x+ ^e~ x < T(x + 1) < v^ar^e^+i^ for all x > 
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which yields, using T(x + 1) = xT(x), the bound T(x) > \/2it/x (x/e) x . We use this 



with x = ^ ^ n to obtain 



r( - (l-A)n \ > / 4vr ({l-\)n 



(l-A)re 

jn\ 2 

(l-A)n V 2e / 
Plugging this into the above we obtain (observe the crucial cancellation of y/n) 
PrpblP 1 "!! > t} 



(1 - A) 2 y/n \1 - X J V 4vr \(l-X)nJ \at / 

= c(A) t^t (-) 2 ( ~~T ) =c(A) 



1 — A/ Vn/ V ^/ \CJy / n(l — X)t / 

which completes the proof of the proposition. □ 



3.2 Proof of Theorem 1.1 

To simplify notation we write c := c(A) and Q := Q(m,n). Proposition 3.1 implies 
that for any e > we have 

Prob {ll^T^^f-) 11 ^} < e. (20) 

- * 2 

Similarly, letting e = e in Proposition 2.3 and solving for t we deduce that, for 
any e G (0, 1], 

Prob{||A|| > Q(Ty/n + (T^2]n^ + l} < e. (21) 

We conclude that 

Prob \k{A) > < 2e, (22) 

A~At(A,(t 2 I) L 1 - A J 

where we have have set, for e £ (0,1], 

*):- ( + yiq + -i=)(^. ( 23 , 

We note that z(l) = £ := C<r( m ) n )i c f- Equation (8). Moreover, um e _>o -2(e) = co and 
z is decreasing in the interval (0, 1]. Hence, for z > (, there exists e = e{z) G (0, 1] 
such that 2 = z(e). 
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We need to upper bound e(z) as a function of z. To do so, we start with a weak 
lower bound on e{z) and claim that 



- In- < ln(2z(e)). 

n e 



(24) 



To show this, recall that Q > yj^fi > ^ due to Lemma 2.4. Hence ( > Q > l/y/2 
and it follows that \[2z < 1 for z > (. Thus, Equation (23) implies that 



— 



Using c > -j= we get 



(V2z) n > (V2z) (1_A)n > - > 



Hence {2z) n > 1/e, which shows the claimed inequality (24). 

Using the bound (24) in Equation (23) we get, again writing z = z(e), that 



which means 



e < c 



fQ + ^21n(2z) + ^) 1 



yjn) z_ 



(l-A)r 



By (22) this completes the proof. 



□ 



3.3 Proof of Corollary 1.3 

Fix Ao £ (0, 1) and put c := c(Ao). Suppose that m < n satisfy A = (m — l)/n < Ao- 

Then n — m+1 = (1 — A)n > (1 — Xo)n and in order to have n — m sufficiently large it 

i 

suffices to require that n is sufficiently large. Thus, c n - m +! < 1.1 if n is sufficiently 
large. Similarly, because of Lemma 2.4, Q(m,n) < 2.1 for large enough n. This 
implies that, for A= < a < 1, we have 



I i m / 1 

Q(m,n) + — f= < 2.1 + — = < 2.1 + W— < 2.1 + a/A + - < 3.1, 
G\Jn a^Jn \ n V n 

provided n is large enough. Then ( a (m,n) < 3.1 • 1.1 = 3.41. 

By Theorem 1.1, the random variable Z := (1 — X)n(A)/e satisfies, for any A 
with \\A\\ < 1 and any z > 3.41, 

n—m+l 

Prob {Z • :! < 2c 

A~N(A,a 2 I) 



< 2c 



( Q(m, n) + y / 21n(2z) + — - 
V " o^JnJ z 

(3.1 + V21n(2.)) - 
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Since 3.1 + -y/21n(2z) < e^fz for z > 4 we deduce that, for all such z, 



Prob > z} < 2c C 



n— m+1 



A~AT(A,cr 2 I) V 2 



Using this tail bound to compute E(-Z) we get 



E(Z) = / Prob{Z > z) dz 




< e 2 + 2c 



e 2 <% = e 2 + 




"OO 




dz 



We can now conclude since 



E((l - A)/c(i4)) = E(eZ) = eE(2) < e 3 + 




< 20.1 



the inequality, again, by taking n large enough. 



□ 



4 Applications 

We next briefly discuss the two applications of our main result mentioned in the 
introduction. 

4.1 Accuracy of Linear Least Squares 

Recall the problem (1) described in the introduction, namely, to compute the mini- 
mum of \\Ax - b\\ 2 over x £ W l for given A G M mxn and b £ W 71 (with m > n). It 
is well known that the loss of precision LoP{A^b) — that is, the number of correct 
digits in the entries of the data (A, b) minus the same number for the computed 
solution At b— satisfies (cf. [26] and [14, Ch. 19]) 

LoP(A t 6) < logmn 3 / 2 + 2log k(A) + 0(1). 

Corollary 1.3, combined with Jensen's inequality, implies that E(log n(A)) < log(20.1/(l— 
A)) = 0(1) under the assumptions stated in the corollary. Hence for sufficiently 
elongated, large matrices A, the expected loss of precision in the computation of the 
solution A^b over all small perturbations A of A is dominated by the term log mn 3 ' 2 . 

4.2 Complexity of the Conjugate Gradient Method 

If P € W nxm is a symmetric positive definite matrix and c G M m , the system 
Px = c can be solved by the Conjugate Gradient Method (CGM), cf. [13]. This is 
an iterative algorithm which performs at most m iterations but may require less. 
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Indeed, it is known (see, e.g., [25, Lecture 38]) that an ^-approximation of the 
solution x can be computed in at most \ y/ n(P)\ lne| iterations (e measures the 
relative error of the approximation with respect to the Euclidean norm). 

In many cases the matrix P arises as P = AA T for some matrix A £ W nxn with 
n > m. If A is standard Gaussian distributed, then the resulting distribution of P, 
called Wishart distribution, has been extensively studied in multivariate statistics. 
However, in our case of interest, A is noncentered and much less is known about the 
resulting distribution of P (called noncentral Wishart). Fortunately, using the fact 
that \J 'k(P) = k(A), we can directly apply our tail bounds for k(A) for a noncentral, 
isotropic Gaussian distribution of A, to derive bounds for the expected number of 
iterations of CGM. 

To do so we use again Corollary 1.3. It shows that for all Ao £ (0, 1) and all 
< a < 1 there exists no such that for all 1 < m < n we have 

sup E (k(A)) < - -, 

||A||<1 A~N(A,aH) l ~ A 

provided A = < \ and n > uq. It follows that if P is obtained as AA^ for a 
large, elongated, rectangular matrix A then, we should expect to compute a solution 
with the desired accuracy with about |fzix| hie| iterations. It is known that each 
iteration of CGM takes 6n 2 + 0{n) arithmetic operations. Therefore, the expected 
cost of running CGM on P is 

2 20.1 ., m . . 60.3n 2 ln tn/ . 

3n 2 - -| lne| + 0(n) = r | lne| + 0(n). 

1 — A 1 — A 

The leading term in this expression is smaller than the |n 3 operations performed 
by Gaussian elimination as long as 

n(l-A) 

e > e si . 

For large n (and A not too close to 1) this bound produces very small values of e and 
therefore, CGM yields, on the average (both for a Wishart distribution of data P 
and for Wishart perturbations of arbitrary data) , remarkably good approximations 
of the solution x = P~ l c. 



5 Some Numerical Simulations 

Section 6 in [5] describes the result of numerical computations producing experi- 
mental values for E(ln k(^4)) (for certain choices of A and a), which are denoted by 
Avr(ln k(A)) and compared with the upper bound for E(ln«(^4)) 

u(m, n, a) := In [m + amybn + m 1 h 

V Jar 
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obtained there. The data in Tables 1 to 4 is taken from [5]. Each row in these tables 
corresponds to a pair (m,n). For each row, 500 random matrices A £ M mxn were 
computed following the distribution N(y/mA,T), where A was chosen as 

_ ones(m,n) 
||ones(m, n)|| ' 

and ones(m,n) denotes the m x n matrix all of whose entries are 1. The column 
with header Avr (In k(A)) shows the empirical average of In k(A) for the 500 chosen 
random matrices A. Since k(A) is scale invariant we note that this corresponds to 
random matrices chosen from N(A, aT), where a = \j\fm. 



m 


n 


Avr (]hk(A)) 


fj,(m, n, a) 


ln(20.1/(l - A)) 


10 


15 


1.88278226808667 


7.73190477060415 




20 


30 


2.04718612539162 


8.74083698937094 




40 


60 


2.13539482051851 


9.75820027818245 


4.0993321 


80 


120 


2.19377719811291 


10.78180469776403 




160 


240 


2.23119383890675 


11.80997066079053 





Table 1: n = 1.5m. 



m 


n 


Avr (In k(A)) 


fj,(m, n, a) 


ln(20.1/(l - A)) 


5 


10 


1.28204418194521 


6.35902343647518 




10 


20 


1.48669849397793 


7.36178009761038 




20 


40 


1.59394635398509 


8.37451330180407 




40 


80 


1.64896402420115 


9.39470162365532 


3.693866 


80 


160 


1.69565973841311 


10.42037692088400 




160 


320 


1.72154032592663 


11.45004561375610 





Table 2: n = 2m. 



m 


n 


Avr (In k(A)) 


fi(m, n, a) 


ln(20.1/(l - A)) 


10 


25 


1.24167342192086 


7.46370799208199 




20 


50 


1.34213347902230 


8.47908853717777 




40 


100 


1.40120155287858 


9.50123344342563 


3.511545 


80 


200 


1.44120596017225 


10.52833707967242 




160 


400 


1.45928497502137 


11.55903912197539 





Table 3: n = 2.5m. 
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m 


n 


Avr (In k(A)) 


H{m, n, a) 


m(20.1/(l — A)) 


5 


15 


0.98741849882614 


6.37209092337754 




10 


30 


1.10550395287499 


7.38102314214432 




20 


60 


1.18790345922560 


8.39838643095583 




40 


120 


1.23914387557043 


9.42199085053742 


3.406185 


80 


240 


1.27096561714092 


10.45015681356392 




160 


480 


1.28600775609989 


12.14829242876138 





Table 4: n = 3m. 



In [5] it is observed that "one sees that when one fixes m and lets n increase 
the quantity Avr(lnK(A)) decreases. This is in contrast with the behaviour of 
[i(m,n, a). It appears that our methods are not sharp enough to capture the be- 
haviour of E(ln k,(A)) ." Compare now with the results of the present paper. It 
follows from Corollary 1.3, by Jensen's inequality, that, for sufficiently large n, 
E(hiK(^4)) < In j^j. But if m is held fixed then, when n increases, A decreases and 
so does In 

One still observes a difference between the bound In and the values of 
Avr (In k(A)). Part of this difference comes from the asymptotic character of this 
bound and the fact that our data is limited to m < 160. One sees on the tables that 
larger values of m would approach Avr (In k(A)) to In We conjecture that, in 
addition to the possible loss of sharpness coming from the use of Jensen's inequality, 
the difference above is due to the roughness of the constant 20.1. 
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